{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [kaggle][学习向]sf-crime数据的传统机器学习（其一）\n",
    "\n",
    "这是一篇专门用来灌水的部分，在这一篇里面，我们将使用传统机器学习的方法建立sf-crime的数据模型\n",
    "\n",
    "**灌水部分**，不调参，不提交，只灌水\n",
    "\n",
    "导入数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.metrics import log_loss\n",
    "\n",
    "train_features = np.load('./data/train_features.npy')\n",
    "train_labels = np.load('./data/train_labels.npy')\n",
    "test_features = np.load('./data/test_features.npy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 朴素贝叶斯"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "朴素贝叶斯的log损失为 2.582578\n"
     ]
    }
   ],
   "source": [
    "from sklearn.naive_bayes import BernoulliNB\n",
    "model = BernoulliNB()\n",
    "model.fit(train_features, train_labels)\n",
    "predicted = np.array(model.predict_proba(train_features))\n",
    "print (\"朴素贝叶斯的log损失为 %f\" % (log_loss(train_labels, predicted)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 逻辑回归"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\XPS\\.conda\\envs\\Pytorch-learn\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "逻辑回归的log损失为 2.540033\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "model = LogisticRegression(C=0.1)\n",
    "model.fit(train_features, train_labels)\n",
    "predicted = np.array(model.predict_proba(train_features))\n",
    "print (\"逻辑回归的log损失为 %f\" % (log_loss(train_labels, predicted)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 决策树\n",
    "\n",
    "易过拟合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "决策树的log损失为 2.543998\n"
     ]
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "model = DecisionTreeClassifier(max_depth = 5)\n",
    "model.fit(train_features, train_labels)\n",
    "predicted = np.array(model.predict_proba(train_features))\n",
    "print (\"决策树的log损失为 %f\" % (log_loss(train_labels, predicted)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 随机森林\n",
    "\n",
    "易内存不足，且有过拟合现象"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "随机森林的log损失为 2.597370\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier  \n",
    "model = RandomForestClassifier(max_leaf_nodes = 5,n_estimators = 10)\n",
    "#随手写的参数\n",
    "model.fit(train_features, train_labels)\n",
    "predicted = np.array(model.predict_proba(train_features))\n",
    "print (\"随机森林的log损失为 %f\" % (log_loss(train_labels, predicted)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 梯度提升决策树\n",
    "\n",
    "以lightGBM为例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lightGBM的log损失为 2.509984\n"
     ]
    }
   ],
   "source": [
    "import lightgbm as lgb\n",
    "model = lgb.LGBMClassifier(num_leaves=35,learning_rate=0.05,n_estimators=20)\n",
    "#随手写的参数\n",
    "model.fit(train_features, train_labels)\n",
    "predicted = np.array(model.predict_proba(train_features))\n",
    "print (\"lightGBM的log损失为 %f\" % (log_loss(train_labels, predicted)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "testResult = np.array(model.predict_proba(test_features))\n",
    "sampleSubmission = pd.read_csv('../input/sf-crime/sampleSubmission.csv.zip')\n",
    "Result_pd = pd.DataFrame(testResult,\n",
    "                         index=sampleSubmission.index,\n",
    "                         columns=sampleSubmission.columns[1:])\n",
    "Result_pd.to_csv('../working/sampleSubmission(test).csv', index_label='Id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提交"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于数据集的问题，SVM很难找到最优超平面，遂舍去。\n",
    "\n",
    "后面会使用神经网络，所以逻辑回归不继续调优。\n",
    "\n",
    "梯度提升决策树将在神经网络后面进行调优。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6.10 64-bit ('Pytorch-learn': conda)",
   "language": "python",
   "name": "python361064bitpytorchlearnconda080df47efea24539a61202fa66a72562"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
